26 research outputs found
Design and Evaluation of Approaches for Automatic Chinese Text Categorization
[[abstract]]In this paper, we propose and evaluate approaches to categorizing Chinese
texts, which consist of term extraction, term selection, term clustering and text
classification. We propose a scalable approach which uses frequency counts to
identify left and right boundaries of possibly significant terms. We used the
combination of term selection and term clustering to reduce the dimension of the
vector space to a practical level. While the huge number of possible Chinese terms
makes most of the machine learning algorithms impractical, results obtained in an
experiment on a CAN news collection show that the dimension could be
dramatically reduced to 1200 while approximately the same level of classification
accuracy was maintained using our approach. We also studied and compared the
performance of three well known classifiers, the Rocchio linear classifier, naive
Bayes probabilistic classifier and k-nearest neighbors(kNN) classifier, when they
were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy,
about 78.3%, but required large amounts of computation time and memory when
used to classify new texts. Rocchio was very time and memory efficient, and
achieved a high level of accuracy, about 75.4%. In practical implementation,
Rocchio may be a good choice
External-Memory Computational Geometry
(c) 1993 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.In this paper we give new techniques for designing
e cient algorithms for computational geometry prob-
lems that are too large to be solved in internal mem-
ory. We use these techniques to develop optimal and
practical algorithms for a number of important large-
scale problems. We discuss our algorithms primarily
in the context of single processor/single disk machines,
a domain in which they are not only the rst known
optimal results but also of tremendous practical value.
Our methods also produce the rst known optimal al-
gorithms for a wide range of two-level and hierarchical
multilevel memory models, including parallel models.
The algorithms are optimal both in terms of I/O cost
and internal computation
Techniques for solving geometric problems on mesh-connected computers
The contributions of this thesis are twofold: (i) we solve optimally some problems on conventional Mesh-Connected Computers, which were not previously solved optimally, and (ii) we present new algorithms for several geometric problems on more realistic models. On conventional Mesh-Connected Computers, in which the n processors are arranged as a (multidimensional) array, we present a new technique for optimally performing n searches on a class of hierarchical DAGs, which leads to the first optimal mesh algorithms for the three dimensional convex hull and convex polyhedra intersection problems, settling an open problem which was posed in (AW88) and in (MS88b). The previous algorithms were a log n factor away from optimality. On the more realistic models (RAM/ARRAY(d)), in which the d-dimensional Mesh-Connected Computer is of fixed-size p and is attached to a random access machine, we present new algorithms for several geometric problems, which achieve the same speedup for a problem of arbitrary size n p as for a problem of size p. The problems include that of computing the all nearest neighbors of a planar set of points, the measure and perimeter of a union of rectangles, visibility of a set of nonintersecting line segments from a point, and dominance counting between two planar sets of points. All of the problems have sequential time complexity (n log n) and have O(p\sp{1/d}) solutions for a problem of size p on a d-dimensional Mesh-Connected Computers of p-processors. Hence, the RAM/ARRAY(d) achieves the speedup of O(p\sp{1-1/d} log p) for a problem of size p. Thus our contribution is to show that the speedup of O(p\sp{1-1/d} log p) can be achieved for arbitrarily large problem size
Improving Linear Classifier for Chinese Text Categorization
[[abstract]]The goal of this paper is to derive extra representatives from each class to compensate for the potential weakness of linear classifiers that compute one representative for each class. To evaluate the effectiveness of our approach, we compared with linear classifier produced by Rocchio algorithm and the k-nearest neighbor (kNN) classifier. Experimental results show that our approach improved linear classifier and achieved micro-averaged accuracy close to that of kNN, with much less classification time. Furthermore, we could provide a suggestion to reorganize the structure of classes when identify new representatives for linear classifier
Passive Forgery Detection for JPEG Compressed Image based on Block Size Estimation and Consistency Analysis
As most of digital cameras and image capture devices do not have modules for embedding watermark or signature, passive forgery detection which aims to detect the traces of tamping without embedded information has become the major focus of recent research for JPEG compressed image. However, our investigation shows that current approaches for detection and localization of tampered areas are very sensitive to image contents, and suffer from high false detection rates for localization of tampered areas for images with intensive edges and textures. In this paper, we present an effective approach which overcomes above problem, using reliable estimation and analysis of block sizes from the block artifacts resulting in JPEG compression process. We first propose an enhanced cross difference filter to strengthen block artifacts and reduce interference from edges and textures, and then integrate techniques from random sampling, voting and maximum likelihood method to improve the accuracy of block size estimation. We develop two different random sampling strategies for block size estimation: one for estimation of the primary JPEG block size, and the other for consistency analysis of local block sizes. Local blocks whose JPEG block sizes are different from the primary block size are classified as tampered blocks. We finally perform a refinement process to eliminate false detections and fill in undetected tampered blocks. Experiment over various tampering methods such as copy-and-paste, image completion and composite tampering, shows that our approach can effectively detect and localize tampered areas, and is not sensitive to image contents such as edges and textures